Representing OCRed documents in HTML
نویسندگان
چکیده
OCR is an error-prone process. It is time-consuming and expensive to manually proofread OCR results. The errors remaining in OCRed texts can cause serious problems in reading and understanding if they do not refer to the original image representation. As demonstrated in this paper, a hybrid document which combines symbolic representation and image representation may relieve the problem. If we represent a OCRed document properly in HTML, OCR errors will not have much negative e ect on the human reading process in a HTML browser and can be corrected by using a HTML authoring tool. Under the approach, an experiment evaluating a Japanese OCR system developed in CEDAR is also reported in this paper. 1 Overview of the Approach OCR is a process to transform a given document from its image representation into its symbolic representation. After this process, we obtain a text document which is electronically searchable, indexable and reusable. However, the transformation is error-prone, especially when the image quality is poor. Therefore, if we just save the pure text document generated by the OCR process, the text may have some potential errors. Although errors in a OCRed text may not cause problem in tasks such as text categorization and retrieval, they may have a negative e ect if the text is to be read by a human being. In a current OCR system, although its nal output usually is just the symbolic content, the detailed information on segmentation and recognition is stored as an intermediate step. The detailed information usually includes: (1) coordinates of the bounding boxes of text blocks, lines, words and characters; (2) alternatives and con dence scores of recognition for each character/word image. Based on this information, many potential OCR errors or suspicious points can be marked. Traditionally, a OCR system provides a specially designed graphical user interface (GUI) for an end-user to proofread OCRed text and manually correct those errors. Figure 1 shows a OCRed text generated by a commercial OCR system. It is very time-consuming and expensive to manually proofread OCR results if a large number of scanned documents have to be transformed into texts. Therefore, for OCRed documents in large digital libraries, usually no proofreading is applied. By reading a OCRed text as shown in Figure 1, people may feel frustrated because OCR errors prevent them from getting exact information from the document. Although it was originally designed for presenting documents on WWW, HTML is rapidly developing into the most popular document representation and a universal, platform-independent client interface for a broad range of applications running on Internet/Intranet and network-centric application servers. Because HTML allows embedded images inside normal text, a OCRed document can be represented in HTML as a combination of text and images. The point is to make potential OCR errors transparent to our reading process: for those words which are recognized with high con dence, recognition results are used; otherwise, original word images are placed. Figure 2 shows the HTML le for the OCRed text in Figure 1. Similarly, Figure 3 illustrates a small Japanese document image and its OCR result in HTML. Although there are more than 6,000 Kanji characters in the Japanese character set, the Japanese OCR system used here is designed to recognize about 3,000 JIS level 1 Kanji characters which are often used in Japan. When characters from JIS level 2 are encountered, recognition results for those characters will always be incorrect. It is best to keep those characters as original images so that the text will still be Figure 1: The OCR result of a text block generated by a commercial OCR system, TypeReader 3.0 from ExperVision Inc.. In the graphical user interface for proofreading, the places where the system failed or had low con dence are highlighted. An user can correct those errors after examining the original image. Figure 2: The HTML le generated from the OCRed text in Figure 1 is displayed in Netscape Navigator. It is a hybrid representation of text and images. Word images from the original document image are embedded when the OCR system failed or had low con dence in recognizing the word.
منابع مشابه
Lessons Learned in Automatically Detecting Lists in OCRed Historical Documents
Lists are often the most data-rich parts of a document collection, but are usually not set apart explicitly from the rest of the text, especially in a corpus of historical OCRed documents. There are many kinds of lists, differing from each other in both layout and content. Writing individualized code to process all possible types of lists is an expensive challenge. In the present research, we f...
متن کاملExtracting and Organizing Facts of Interest from OCRed Historical Documents
Historical documents contain facts that family history enthusiasts are interested in extracting. In addition to fact extraction, organizing these facts into disambiguated entity records is also of interest. This paper shows how facts from an excerpt of a page in an OCRed book can be gathered automatically with some expert knowledge.
متن کاملHybred: An OCR Document Representation for Classification Tasks
The classification of digital documents is a complex task in a document analysis flow. The amount of documents resulting from the OCR retro-conversion (optical character recognition) makes the classification task harder. In the literature, different features are used to improve the classification quality. In this paper, we evaluate various features on OCRed and non OCRed documents. Thanks to th...
متن کامل© Ijcsi Publication 2011
The classification of digital documents is a complex task in a document analysis flow. The amount of documents resulting from the OCR retro-conversion (optical character recognition) makes the classification task harder. In the literature, different features are used to improve the classification quality. In this paper, we evaluate various features on OCRed and non OCRed documents. Thanks to th...
متن کاملComplementary Approaches to Representing Differences Between Structured Documents
Structured documents Documents can be represented as structures with a hierarchical arrangement of text and non-text nodes, where nodes are labelled by category names such as “paragraph” and “section”. Representing documents this way is a natural consequence of using the Standard Generalized Markup Language (SGML) to encode the content and form of documents [10, 11, 7]. SGML is widely used. HTM...
متن کامل